Introduction to Open Data Science - Course Project

About the project

This is a course on Open Data Science.

Course decription

More information can be found from [here]:(https://studies.helsinki.fi/courses/course-implementation/hy-opt-cur-2324-6bc9901d-e80d-4ba7-8bce-829e42c15521/PHD-302)

date()
## [1] "Mon Dec 11 19:44:43 2023"

I have only completed so far reading and exercises from chapter 1 and 2 and I’m panicking a little. For me it was difficult to understand how the course starts and I was confused what to complete. Also, I didn’t remember how much work 5 credits course can be :) I expect to learn how to use Github and understand R code. I heard about the course from university email.

The crash course is excellent, easy to follow and understand, but I should have had more time to go it trough, because everything was new to me. So far I have only completed chapters 1 and 2.

Github My GitHub repository is here


Assignment 2: Data Analysis excercise

This week I have completed the Excercise2.Rmd, Data wrangling excercise and Data analysis where I learned some data wrangling, like modifying data sets, extracting and combining desired variables into a new data set, how to check the data, do some statistical tests and visualizations. I also spent a lot of time learning new words and concepts to better understand these statistical analyses.

date()
## [1] "Mon Dec 11 19:44:43 2023"

Getting to know the data

  • First lets read the data into a data frame in R using command read.csv() and name it as students2014.
  • Then we can check the dimension using dim() command and structure using str() command

R code

students2014 <- read.csv("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/learning2014.txt")

dim(students2014)
## [1] 166   7
str(students2014)
## 'data.frame':    166 obs. of  7 variables:
##  $ gender  : chr  "F" "M" "F" "M" ...
##  $ age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: num  3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ points  : int  25 12 24 10 22 21 21 31 24 26 ...
  • Data has 166 rows (observations) and 7 columns (variables)
  • age and points variables are type integers (int) that indicate numerical values, gender variable is a type character (chr) that is categorical character with two variables: female or male. Rest of the variables (attitude, deep, stra and surf) are numerical variables with different scales.

Data visualization

  • First let’s check the summaries of the variables in the data using summary() commmand
  • Next let’s see a graphical overview of the data using ggplot2 library and creating a scatter plot

R code

# students2014 is available

# Access the gglot2 library
library(ggplot2)

# Summary of numeric variables
summary(students2014)
##     gender               age           attitude          deep      
##  Length:166         Min.   :17.00   Min.   :1.400   Min.   :1.583  
##  Class :character   1st Qu.:21.00   1st Qu.:2.600   1st Qu.:3.333  
##  Mode  :character   Median :22.00   Median :3.200   Median :3.667  
##                     Mean   :25.51   Mean   :3.143   Mean   :3.680  
##                     3rd Qu.:27.00   3rd Qu.:3.700   3rd Qu.:4.083  
##                     Max.   :55.00   Max.   :5.000   Max.   :4.917  
##       stra            surf           points     
##  Min.   :1.250   Min.   :1.583   Min.   : 7.00  
##  1st Qu.:2.625   1st Qu.:2.417   1st Qu.:19.00  
##  Median :3.188   Median :2.833   Median :23.00  
##  Mean   :3.121   Mean   :2.787   Mean   :22.72  
##  3rd Qu.:3.625   3rd Qu.:3.167   3rd Qu.:27.75  
##  Max.   :5.000   Max.   :4.333   Max.   :33.00
# initialize plot with data and aesthetic mapping
p1 <- ggplot(students2014, aes(x = attitude, y = points))

# define the visualization type (points)
p2 <- p1 + geom_point()

# draw the plot
p2

# add a regression line
p3 <- p2 + geom_smooth(method = "lm")

# add a main title
p4 <- p3 + ggtitle("Student's attitude versus exam points")

# draw the plot!
p4
## `geom_smooth()` using formula = 'y ~ x'

Description and interpretion of the outputs

  • Summary shows a summary of numeric values from data set like minimum, maximum, median, mean and quartiles.
  • p2 is a scatterplot with regression Line, where where attitude is on the x-axis and points is on the y-axis.
  • Then linear regression line is added to p3 to visualize the relationship between attitude and points.
  • Finally a tittle (“Student’s attitude versus exam points”) is added to p4.
  • The distribution of the exam points is quite variable, and based on the mean students have gotten over 22.72 points from the exam. The slope of the regression line is upward, so we can interpret that there is a positive relationship between the exam points and attitude –> higher attitudes are associated with higher exam points.

Statistical test to interpret a statistically significant relationship with the explanatory variable and target variables

  • Let’s choose three variables (age, attitude, and deep) as explanatory variables, and points as the target variable and form a linear regression model from those.

R code

# students2014 is available

# Access the gglot2 library
library(ggplot2)

model <- lm(points ~ age + attitude + deep, data = students2014)

# Summary of the fitted model
summary(model)
## 
## Call:
## lm(formula = points ~ age + attitude + deep, data = students2014)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -17.206  -3.430   0.204   3.979  10.950 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 15.60773    3.38966   4.605 8.32e-06 ***
## age         -0.07716    0.05322  -1.450    0.149    
## attitude     3.59413    0.56959   6.310 2.56e-09 ***
## deep        -0.60275    0.75031  -0.803    0.423    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.307 on 162 degrees of freedom
## Multiple R-squared:  0.2043, Adjusted R-squared:  0.1896 
## F-statistic: 13.87 on 3 and 162 DF,  p-value: 4.305e-08
model2 <- lm(points ~ attitude, data = students2014)

# Summary of the fitted model2
summary(model2)
## 
## Call:
## lm(formula = points ~ attitude, data = students2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.6372     1.8303   6.358 1.95e-09 ***
## attitude      3.5255     0.5674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

Explanation of the relationship between the chosen explanatory variables and the target variable and the multiple R-squared of the model

  • Based on summary we can see a summary of the fitted model, including coefficients, standard errors, t-values, and p-values.
  • If the coefficient is positive it indicates a positive relationship, and if it’s negative, it indicates a negative relationship between the target variable (points) and explanatory variable (age, attitude or deep)
  • The standard errors measure the variability of the estimated coefficients.
  • Smaller the standard errors are, usually means more precise estimates.
  • t-value can be calculated by dividing the coefficient by its standard error. Results can be excepted to be significant, the higher the t-values are.
  • If p-value is below treshold of 0.05 (that is commonly used) the variable can be assumed to be statistically significant. Smaller the p-value, stronger the evidence against the null hypothesis. In this fitted model the p-value for attitude is very low (4.12e-09) –> so age and deep variables were removed.
  • The multiple R-squared is 0.1906. This means that approximately 19.06% of the variability in exam points is explained by the attitude variable.
  • Lastly, there is a lot of variability in exam points, but only with these factors it’s impossible to know why. –> Some other factors in the original data set could explain this variability better.

Diagnostic plots: Residuals vs Fitted values, Normal QQ-plot and Residuals vs Leverage.

R code

# students2014 and model2 is available

# Place the following 4 graphics to one plot with plot layout of 2x2
par(mfrow = c(2,2))

# draw diagnostic plots using the plot() function. Choose the plots 1, 2 and 5
plot(model2, which = c(1, 2, 5))

Explanation of the assumptions of the model and interpret the validity of those assumptions based on the diagnostic plots

  1. Plot on Residuals vs Fitted Values: If the relationship between fitted values and residuals are uncorrelated, they should form linear model. So the relationship between residuals and fitted values is random and without a clear pattern. If it something else than linear (like curve or funnel) it indicates broken linearity assumption. –> Conclusions: Points are scattered and no specific pattern is shown. Plot seems also linear. Linearity assumption is valid.

  2. Normal QQ-plot: Normal QQ-plot is used to check whether or not a set of data follows a normal distribution. If it does, the points should form a straight line. However, is the points deviate the data is less likely to follow a normal distribution.–> Conclusions: The points follow the diagonal line pretty well, but some deviations can be seen in the “tails”. Still, it’s ok to assume that the data is normally distributed.

  3. Residuals vs Leverage: This plot shows each observation as a single point within the plot. The x-axis shows the leverage of each point and the y-axis shows the standardized residual of each point. Leverage is the change in the coefficients of the regression model if a particular observation is removed from the data.The standardised residual refers to the standardised difference between the predicted value of an observation and the actual value of the observation. –> Conclusions: The residuals are spread relatively constantly and no influential points can’t be observed.

Done!


Assignment 3: Data Analysis excercise

This week the aim is to statistically investigate a dataset related to student performance and alcohol consumpion in secondary education of two Portuguese schools.

date()
## [1] "Mon Dec 11 19:44:45 2023"

Data set is available here

Reading data into R

R code

# Read the data table in R from liked file and save it as "alc" using comma as the column separator. First row includes column names. 
alc <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/alc.csv", sep=",", header=TRUE)

# Load required libraries
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)

# print out the names of the variables in the data
colnames(alc)
##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "guardian"   "traveltime" "studytime"  "schoolsup" 
## [16] "famsup"     "activities" "nursery"    "higher"     "internet"  
## [21] "romantic"   "famrel"     "freetime"   "goout"      "Dalc"      
## [26] "Walc"       "health"     "failures"   "paid"       "absences"  
## [31] "G1"         "G2"         "G3"         "alc_use"    "high_use"

This dataset is downloaded from the UCI Machine Learning Repository and related to student performance in secondary education of two Portuguese schools. The data is orginated from reports and questionnaires and contains variables that include features like age, gender, family background, study time, and various academic performance metrics (e.g., grades). The dataset contains also information about the students’ alcohol consumption during weekdays and weekends.

Associations with alcohol consumption, study period, gender and age

Lets choose following variables to study their relationship with alcohol consumption : study time, sex and age.

First let’s look at the distribution of alcohol consumption (‘Walc’ for weekends and ‘Dalc’ for weekdays) and its relationship with the variable ‘studytime’.

R code

# Explore the dataset structure
glimpse(alc)
## Rows: 370
## Columns: 35
## $ school     <chr> "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP",…
## $ sex        <chr> "F", "F", "F", "F", "F", "M", "M", "F", "M", "M", "F", "F",…
## $ age        <int> 18, 17, 15, 15, 16, 16, 16, 17, 15, 15, 15, 15, 15, 15, 15,…
## $ address    <chr> "U", "U", "U", "U", "U", "U", "U", "U", "U", "U", "U", "U",…
## $ famsize    <chr> "GT3", "GT3", "LE3", "GT3", "GT3", "LE3", "LE3", "GT3", "LE…
## $ Pstatus    <chr> "A", "T", "T", "T", "T", "T", "T", "A", "A", "T", "T", "T",…
## $ Medu       <int> 4, 1, 1, 4, 3, 4, 2, 4, 3, 3, 4, 2, 4, 4, 2, 4, 4, 3, 3, 4,…
## $ Fedu       <int> 4, 1, 1, 2, 3, 3, 2, 4, 2, 4, 4, 1, 4, 3, 2, 4, 4, 3, 2, 3,…
## $ Mjob       <chr> "at_home", "at_home", "at_home", "health", "other", "servic…
## $ Fjob       <chr> "teacher", "other", "other", "services", "other", "other", …
## $ reason     <chr> "course", "course", "other", "home", "home", "reputation", …
## $ guardian   <chr> "mother", "father", "mother", "mother", "father", "mother",…
## $ traveltime <int> 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 3, 1, 2, 1, 1, 1, 3, 1, 1,…
## $ studytime  <int> 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 3, 1, 2, 3, 1, 3, 2, 1, 1,…
## $ schoolsup  <chr> "yes", "no", "yes", "no", "no", "no", "no", "yes", "no", "n…
## $ famsup     <chr> "no", "yes", "no", "yes", "yes", "yes", "no", "yes", "yes",…
## $ activities <chr> "no", "no", "no", "yes", "no", "yes", "no", "no", "no", "ye…
## $ nursery    <chr> "yes", "no", "yes", "yes", "yes", "yes", "yes", "yes", "yes…
## $ higher     <chr> "yes", "yes", "yes", "yes", "yes", "yes", "yes", "yes", "ye…
## $ internet   <chr> "no", "yes", "yes", "yes", "no", "yes", "yes", "no", "yes",…
## $ romantic   <chr> "no", "no", "no", "yes", "no", "no", "no", "no", "no", "no"…
## $ famrel     <int> 4, 5, 4, 3, 4, 5, 4, 4, 4, 5, 3, 5, 4, 5, 4, 4, 3, 5, 5, 3,…
## $ freetime   <int> 3, 3, 3, 2, 3, 4, 4, 1, 2, 5, 3, 2, 3, 4, 5, 4, 2, 3, 5, 1,…
## $ goout      <int> 4, 3, 2, 2, 2, 2, 4, 4, 2, 1, 3, 2, 3, 3, 2, 4, 3, 2, 5, 3,…
## $ Dalc       <int> 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1,…
## $ Walc       <int> 1, 1, 3, 1, 2, 2, 1, 1, 1, 1, 2, 1, 3, 2, 1, 2, 2, 1, 4, 3,…
## $ health     <int> 3, 3, 3, 5, 5, 5, 3, 1, 1, 5, 2, 4, 5, 3, 3, 2, 2, 4, 5, 5,…
## $ failures   <int> 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0,…
## $ paid       <chr> "no", "no", "yes", "yes", "yes", "yes", "no", "no", "yes", …
## $ absences   <int> 5, 3, 8, 1, 2, 8, 0, 4, 0, 0, 1, 2, 1, 1, 0, 5, 8, 3, 9, 5,…
## $ G1         <int> 2, 7, 10, 14, 8, 14, 12, 8, 16, 13, 12, 10, 13, 11, 14, 16,…
## $ G2         <int> 8, 8, 10, 14, 12, 14, 12, 9, 17, 14, 11, 12, 14, 11, 15, 16…
## $ G3         <int> 8, 8, 11, 14, 12, 14, 12, 10, 18, 14, 12, 12, 13, 12, 16, 1…
## $ alc_use    <dbl> 1.0, 1.0, 2.5, 1.0, 1.5, 1.5, 1.0, 1.0, 1.0, 1.0, 1.5, 1.0,…
## $ high_use   <lgl> FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALS…
# data has 370 rows and 35 columns

# Summary statistics of relevant variables
summary(alc$Walc)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   1.000   2.000   2.295   3.000   5.000
summary(alc$Dalc)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   1.000   1.000   1.484   2.000   5.000
summary(alc$studytime)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   1.000   2.000   2.043   2.000   4.000
# Cross-tabulation of studytime and Walc
cross_tab_walc_studytime <- table(alc$studytime, alc$Walc)
print(cross_tab_walc_studytime)
##    
##      1  2  3  4  5
##   1 22 24 19 20 13
##   2 72 38 40 24 11
##   3 30 14 13  2  1
##   4 17  4  3  1  2

Because the Mean value for Walc is slightly bigger than for Dalc, let’s focus on weekly alcohol consumption

R code

# Bar plot of average Walc for each level of studytime
alc_studytime_bar_plot <- ggplot(alc, aes(x = as.factor(studytime), y = Walc)) +
  geom_bar(stat = "summary", fun = "mean", fill = "skyblue") +
  labs(title = "Average Weekend Alcohol Consumption by Study Time",
       x = "Study Time",
       y = "Average Weekend Alcohol Consumption")

print(alc_studytime_bar_plot)

R code

# Box plot of Walc by studytime
alc_studytime_box_plot <- ggplot(alc, aes(x = as.factor(studytime), y = Walc)) +
  geom_boxplot(fill = "skyblue") +
  labs(title = "Weekend Alcohol Consumption by Study Time",
       x = "Study Time",
       y = "Weekend Alcohol Consumption")

print(alc_studytime_box_plot)

Alcohol consumption vary across different levels of study time and it seems that students with higher study times have lower average alcohol consumption.

Then let’s look at the distribution of alcohol consumption (‘Walc’ for weekends and ‘Dalc’ for weekdays) and its relationship with the variable ‘sex’.

R code

# Summary statistics of relevant variables
summary(alc$sex)
##    Length     Class      Mode 
##       370 character character
# Cross-tabulation of sex and Walc
cross_tab_sex <- table(alc$sex, alc$Walc)
print(cross_tab_sex)
##    
##      1  2  3  4  5
##   F 87 48 43 13  4
##   M 54 32 32 34 23

R code

# Bar plot of average Walc for each level of sex
alc_sex_bar_plot <- ggplot(alc, aes(x = as.factor(sex), y = Walc)) +
  geom_bar(stat = "summary", fun = "mean", fill = "skyblue") +
  labs(title = "Average Weekend Alcohol Consumption by sex",
       x = "sex",
       y = "Average Weekend Alcohol Consumption")

print(alc_sex_bar_plot)

R code

# Box plot of Walc by sex
alc_sex_box_plot <- ggplot(alc, aes(x = as.factor(sex), y = Walc)) +
  geom_boxplot(fill = "skyblue") +
  labs(title = "Weekend Alcohol Consumption by sex",
       x = "sex",
       y = "Weekend Alcohol Consumption")

print(alc_sex_box_plot)

Average weekend alcohol consumption seems to be slightly higher with males than females.

Lastly, let’s look at the distribution of alcohol consumption (‘Walc’ for weekends) and its relationship with the variable ‘age’.

R code

# Summary statistics of age
summary(alc$age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   15.00   16.00   17.00   16.58   17.00   22.00
# Cross-tabulation of age and Walc
cross_tab_age <- table(alc$age, alc$Walc)
print(cross_tab_age)
##     
##       1  2  3  4  5
##   15 46 12 13  7  3
##   16 41 23 15 15  8
##   17 25 23 27 14  8
##   18 25 19 15 11  7
##   19  3  3  5  0  0
##   20  1  0  0  0  0
##   22  0  0  0  0  1

R code

# Bar plot of average Walc for each level of age
alc_age_bar_plot <- ggplot(alc, aes(x = as.factor(age), y = Walc)) +
  geom_bar(stat = "summary", fun = "mean", fill = "skyblue") +
  labs(title = "Average Weekend Alcohol Consumption by age",
       x = "age",
       y = "Average Weekend Alcohol Consumption")

print(alc_age_bar_plot)

R code

# Box plot of Walc by age
alc_age_box_plot <- ggplot(alc, aes(x = as.factor(age), y = Walc)) +
  geom_boxplot(fill = "skyblue") +
  labs(title = "Weekend Alcohol Consumption by age",
       x = "age",
       y = "Weekend Alcohol Consumption")

print(alc_age_box_plot)

Average weekend alcohol consumption seems to increase slightly from 15 to 17 years old students, but after that decrease until thei are 20 years old and again increase a lot when students are 22 years old. After inspecting the box plot, it seems that the data has some outliers in age groups of 15, 17, 20 and 22 that can influence the overall patterns observed in the relationship between age and alcohol consumption. To investigate this, let’s calculate the correlation coefficient between age and alcohol consumption.

Correlation coefficient between age and alcohol consumption

R code

# Calculate the correlation coefficient between age and Walc
correlation_walc_age <- cor(alc$age, alc$Walc)

# Print the correlation coefficient
print(correlation_walc_age)
## [1] 0.1551427

The correlation coefficient will be a value between -1 and 1. If the value is positive, it has positive correlation and if negative it has negative correlation. If the value is either 1 and -1 it indicates a perfect correlation, and 0 indicates no correlation.

The correlation coefficient between age and Walc is [1] 0.1551427. This value is very close to 0, but still positive so it means age has very minor but a positive correlation between alcohol consumption during the weekend.

Generalized Linear Models

Let’s perform logistic regression model and use high_use as the binary outcome variable, and study time, sex, and age as the predictor variables. The family = “binomial” argument specifies logistic regression.

R code

# Fit logistic regression model
logistic_model <- glm(high_use ~ studytime + sex + age, data = alc, family = "binomial")

# Summarize the fitted model
summary(logistic_model)
## 
## Call:
## glm(formula = high_use ~ studytime + sex + age, family = "binomial", 
##     data = alc)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  -3.6554     1.7142  -2.132  0.03297 * 
## studytime    -0.4972     0.1601  -3.105  0.00190 **
## sexM          0.7072     0.2474   2.858  0.00426 **
## age           0.2054     0.1008   2.037  0.04161 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 421.24  on 366  degrees of freedom
## AIC: 429.24
## 
## Number of Fisher Scoring iterations: 4

Based on this logistic regression model it seems that each predictor variable is associated with the “high_use” of alcohol. The p-values suggest that study time (p-value of 0.00190), sex (p-value of 0.00426), and age (p-value of 0.04161) are statistically significant predictors of alcohol consumption.

The coefficient for sexM and are positive. So when age increases, the “high_use” category (high alcohol consumtpion) also increases. It is also related to sex and males have higher alcohol consumption. The coefficient for studytime is negative, indicating that as study time increases, the log odds of being in the “high_use” category decrease.

R code

# Extract odds ratios and confidence intervals
odds_ratios <- exp(coef(logistic_model))
conf_intervals <- confint(logistic_model)
## Waiting for profiling to be done...
# Print odds ratios and confidence intervals
print("Odds Ratios:")
## [1] "Odds Ratios:"
print(odds_ratios)
## (Intercept)   studytime        sexM         age 
##  0.02585034  0.60825171  2.02832763  1.22807258
print("Confidence Intervals:")
## [1] "Confidence Intervals:"
print(conf_intervals)
##                    2.5 %     97.5 %
## (Intercept) -7.063114273 -0.3259619
## studytime   -0.821067712 -0.1918567
## sexM         0.224462537  1.1962918
## age          0.009734844  0.4061281

Odds ratio for study time is 0.60825171 –> So that means that for each one-unit increase in study time, the odds of being in the “high_use” category decreases by approximately 39.2% (1 - 0.6082).

Odds ratio for sex (sexM) is 2.02832763 –> meaning that males have approximately 2.03 times the odds of being in the “high_use” category compared to females.

Odds Ratio for age is 1.22807258 –> Each one-unit increase in age, means that the odds of being in the “high_use” category increases by approximately 22.8% (1.2281 - 1).

R code

print("Confidence Intervals:")
## [1] "Confidence Intervals:"
print(conf_intervals)
##                    2.5 %     97.5 %
## (Intercept) -7.063114273 -0.3259619
## studytime   -0.821067712 -0.1918567
## sexM         0.224462537  1.1962918
## age          0.009734844  0.4061281

If confidence intervals includes 0 or 1, it suggests that the parameter is not statistically significant. In this data set, confidence intervals for study time (-0.8211, -0.1919), sex (0.2245, 1.1963) and age (0.0097, 0.4061) do not include 0 for coefficients and do not include 1 for odds ratios –> We can be 95% confident that these variables have a true effect on being in the “high_use” category and thus are statistically significant.

predictive power of the model

R code

# Make predictions using the logistic regression model
predictions <- predict(logistic_model, type = "response")

# Convert predicted probabilities to binary predictions (0 or 1)
predicted_classes <- ifelse(predictions > 0.5, 1, 0)

# Create a data frame with actual and predicted values
prediction_df <- data.frame(Actual = alc$high_use, Predicted = predicted_classes)

# 2x2 Cross-tabulation
confusion_matrix <- table(prediction_df$Actual, prediction_df$Predicted)
print("Confusion Matrix:")
## [1] "Confusion Matrix:"
print(confusion_matrix)
##        
##           0   1
##   FALSE 244  15
##   TRUE   92  19
# Calculate training error (proportion of inaccurately classified individuals)
training_error <- mean(prediction_df$Actual != prediction_df$Predicted)
print(paste("Training Error:", training_error))
## [1] "Training Error: 0.289189189189189"
# Display a graphic visualizing actual values and predictions
library(ggplot2)

# Scatter plot of age versus studytime
age_studytime_scatter_plot <- ggplot(prediction_df, aes(x = alc$age, y = alc$studytime, color = factor(Predicted))) +
  geom_point() +
  labs(title = "Scatter Plot of Age vs. Study Time with Predictions",
       x = "Age",
       y = "Study Time",
       color = "Predicted") +
  theme_minimal()

print(age_studytime_scatter_plot)

# Remove rows with missing values in the 'high_use' column
alc_no_missing <- alc[complete.cases(alc$high_use), ]

# Compare with a simple guessing strategy for the majority class
majority_class <- as.numeric(names(sort(table(alc_no_missing$high_use), decreasing = TRUE))[1])
## Warning: NAs introduced by coercion
simple_guessing <- rep(majority_class, nrow(alc_no_missing))

# Calculate training error for the simple guessing strategy
simple_guessing_error <- mean(alc_no_missing$high_use != simple_guessing)
print(paste("Simple Guessing Error:", simple_guessing_error))
## [1] "Simple Guessing Error: NA"

10-fold cross-validation

R code

library(boot)

# Define the logistic regression model
logistic_model <- glm(high_use ~ studytime + sex + age, data = alc, family = "binomial")

# Define the logistic regression model function for cross-validation
logistic_model_func <- function(data, indices) {
  train_data <- data[indices, ]
  test_data <- data[-indices, ]
  
  # Fit the logistic regression model on the training data
  model <- glm(high_use ~ studytime + sex + age, data = train_data, family = "binomial")
  
  # Make predictions on the test data
  predictions <- predict(model, test_data, type = "response")
  
  # Convert predicted probabilities to binary predictions
  predicted_classes <- ifelse(predictions > 0.5, 1, 0)
  
  # Return the predicted classes for the test set
  return(predicted_classes)
}

# Perform 10-fold cross-validation
cv_results <- cv.glm(data = alc, glmfit = logistic_model, K = 10)

# Calculate the average error across folds
cv_error <- mean(cv_results$delta)

# Print the cross-validation error
print(paste("10-Fold Cross-Validation Error:", cv_error))
## [1] "10-Fold Cross-Validation Error: 0.19665616037216"

Assignment 4: Data Analysis excercise

date()
## [1] "Mon Dec 11 19:44:45 2023"

The Boston data set is available in R and can be downloaded from the MASS package. It contains information on housing values in suburbs of Boston. Each row represents a different town, and each column are variables that provide various aspects of the towns.

Reading data into R

# access the MASS package
library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
# load the data
data("Boston")

# explore the structure of the data set
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
# exlpore the dimension of the data set
dim(Boston)
## [1] 506  14

Boston data set has 506 observations (rows) and 14 variables (columns), most variables are numeric except chas is integer.

Graphical overview of the data

Following shows a graphical overview and summaries of the variables in the data. Some distributions of variables and the relationships between them are also discussed.

# Look at the structure of the data set 
summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00
# plot matrix of the variables
pairs(Boston)

# Access tidyr and corrplot libraries
library(tidyr)
library(corrplot)
## corrplot 0.92 loaded
# calculate the correlation matrix and round it
cor_matrix <- Boston %>% 
  cor() %>% 
  round(2)

# print the correlation matrix
print(cor_matrix)
##          crim    zn indus  chas   nox    rm   age   dis   rad   tax ptratio
## crim     1.00 -0.20  0.41 -0.06  0.42 -0.22  0.35 -0.38  0.63  0.58    0.29
## zn      -0.20  1.00 -0.53 -0.04 -0.52  0.31 -0.57  0.66 -0.31 -0.31   -0.39
## indus    0.41 -0.53  1.00  0.06  0.76 -0.39  0.64 -0.71  0.60  0.72    0.38
## chas    -0.06 -0.04  0.06  1.00  0.09  0.09  0.09 -0.10 -0.01 -0.04   -0.12
## nox      0.42 -0.52  0.76  0.09  1.00 -0.30  0.73 -0.77  0.61  0.67    0.19
## rm      -0.22  0.31 -0.39  0.09 -0.30  1.00 -0.24  0.21 -0.21 -0.29   -0.36
## age      0.35 -0.57  0.64  0.09  0.73 -0.24  1.00 -0.75  0.46  0.51    0.26
## dis     -0.38  0.66 -0.71 -0.10 -0.77  0.21 -0.75  1.00 -0.49 -0.53   -0.23
## rad      0.63 -0.31  0.60 -0.01  0.61 -0.21  0.46 -0.49  1.00  0.91    0.46
## tax      0.58 -0.31  0.72 -0.04  0.67 -0.29  0.51 -0.53  0.91  1.00    0.46
## ptratio  0.29 -0.39  0.38 -0.12  0.19 -0.36  0.26 -0.23  0.46  0.46    1.00
## black   -0.39  0.18 -0.36  0.05 -0.38  0.13 -0.27  0.29 -0.44 -0.44   -0.18
## lstat    0.46 -0.41  0.60 -0.05  0.59 -0.61  0.60 -0.50  0.49  0.54    0.37
## medv    -0.39  0.36 -0.48  0.18 -0.43  0.70 -0.38  0.25 -0.38 -0.47   -0.51
##         black lstat  medv
## crim    -0.39  0.46 -0.39
## zn       0.18 -0.41  0.36
## indus   -0.36  0.60 -0.48
## chas     0.05 -0.05  0.18
## nox     -0.38  0.59 -0.43
## rm       0.13 -0.61  0.70
## age     -0.27  0.60 -0.38
## dis      0.29 -0.50  0.25
## rad     -0.44  0.49 -0.38
## tax     -0.44  0.54 -0.47
## ptratio -0.18  0.37 -0.51
## black    1.00 -0.37  0.33
## lstat   -0.37  1.00 -0.74
## medv     0.33 -0.74  1.00
# visualize the correlation matrix
corrplot(cor_matrix, method="circle", type = "upper", cl.pos = "b", tl.pos = "d", tl.cex = 0.6) %>% summary()

##         Length Class      Mode   
## corr    196    -none-     numeric
## corrPos   5    data.frame list   
## arg       1    -none-     list

The correlation matrix shows the pairwise correlations between variables. This indicates the strength and direction of linear relationships. Positive values indicate a positive linear relationship, negative values indicate a negative linear relationship, and values close to zero suggest a weak or no linear relationship. Since eyeballing the correlation matrix can be quite challenging with bigger data sets and also with this one, I use the corrplot visualization to intrepet the correaltions. Strong negative or positive correlations (closer to -1 or 1) have darker color (red close to -1 and blue close to 1.). Also size of the circle is indicating the magnitude of the correlation.

Some of the variables are highly correlated, either negatively or positively. Strongly negatively correlated values are i.e. lower status of the population (lstat) and median value of owner-occupied homes in $1000 (medv). Strongly positively correlated values are i.e. full-value property-tax rate per $10,000 (tax) and proportion of non-retail business acres per town (indus).

Because they are not all normally distributed scaling of the variables is needed.

Standardizing the data set

Standardize the dataset is standardized as follows:

# center and standardize variables
boston_scaled <- scale(Boston)

# summaries of the scaled variables
summary(boston_scaled)
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865
# class of the boston_scaled object
class(boston_scaled)
## [1] "matrix" "array"
# change the object to data frame
boston_scaled <- as.data.frame(boston_scaled)

Categorize the data by crime rate

Then I’ll create a categorical variable crim (per capita crime rate by town) to be our factor variable and cut the variable by quantiles to get the high, low and middle rates of crime into their own categories. After this I’ll remove the original crim variable from the data set and substitute with a new one.

# create a quantile vector of crim and print it
bins <- quantile(boston_scaled$crim)
bins
##           0%          25%          50%          75%         100% 
## -0.419366929 -0.410563278 -0.390280295  0.007389247  9.924109610
# create a categorical variable 'crime'
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, labels = c("low", "med_low", "med_high", "high"))

# look at the table of the new factor crime
table(crime)
## crime
##      low  med_low med_high     high 
##      127      126      126      127
# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)

# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)

Now, data will be divided to test and train sets, with 80 % of the data going to train set. We also save the correct classes from the test set and remove the crime variable from it.

# number of rows in the Boston dataset 
n <- nrow(boston_scaled)

# choose randomly 80% of the rows
ind <- sample(n,  size = n * 0.8)

# create train set
train <- boston_scaled[ind,]

# create test set 
test <- boston_scaled[-ind,]

# save the correct classes from test data
correct_classes <- test$crime

# remove the crime variable from test data
test <- dplyr::select(test, -crime)

linear discriminant analysis –> LDA (bi)plot

The categorical crime rate is used as the target variable in the linear discriminant analysis. All the other variables in the data set are used as predictor variables.

# linear discriminant analysis
lda.fit <- lda(crime ~ ., data = train)

# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2425743 0.2599010 0.2549505 0.2425743 
## 
## Group means:
##                  zn      indus         chas        nox         rm        age
## low       0.9130483 -0.8950041 -0.151805591 -0.8615010  0.4291075 -0.8793412
## med_low  -0.1005293 -0.2982934 -0.009855719 -0.5548611 -0.1619742 -0.3170439
## med_high -0.3773415  0.1638680  0.301035044  0.3483259  0.2413675  0.4155388
## high     -0.4872402  1.0171960 -0.031282114  1.1029298 -0.3896971  0.8153299
##                 dis        rad        tax     ptratio      black       lstat
## low       0.8270651 -0.6888949 -0.7226799 -0.37632230  0.3710387 -0.73771431
## med_low   0.3471767 -0.5476413 -0.4618963 -0.06568307  0.3164061 -0.12329985
## med_high -0.3511543 -0.3753022 -0.2915106 -0.30324295  0.1370716 -0.02821847
## high     -0.8602633  1.6373367  1.5134896  0.77985517 -0.7695333  0.91500117
##                  medv
## low       0.490266324
## med_low  -0.007709118
## med_high  0.273324689
## high     -0.688898102
## 
## Coefficients of linear discriminants:
##                  LD1         LD2         LD3
## zn       0.074552946  0.84966455 -0.91589354
## indus    0.018170689 -0.34323897  0.24092745
## chas    -0.080851312 -0.06197874  0.04358941
## nox      0.468065359 -0.68162882 -1.38284512
## rm      -0.073408090 -0.17123796 -0.26652658
## age      0.298177360 -0.37306911 -0.05579778
## dis      0.021937737 -0.51239207  0.13218308
## rad      2.973023616  0.76586908 -0.01464469
## tax     -0.009623886  0.08188377  0.55558121
## ptratio  0.107331709  0.09803687 -0.32259536
## black   -0.133389954 -0.01831856  0.11223747
## lstat    0.244326768 -0.30774150  0.33087817
## medv     0.237565655 -0.48836623 -0.17810951
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9437 0.0419 0.0144
# target classes as numeric
classes <- as.numeric(train$crime)

# plot the lda results 
plot(lda.fit, dimen=2, col=classes)

Next I’d save the crime categories from the test set and remove the categorical crime variable from the test dataset. Then I’d predict the crime rate in the test set using the model from the train set and cross-tabulate it with the correct classes. However, I got error from this code and were not able to fix it in time.

Code to save the correct classes from test data: correct_classes <- test$crime

Code to remove the crime variable from test data: test <- dplyr::select(test, -crime)

Code to predict classes with test data: lda.pred <- predict(lda.fit, newdata = test)

Code to cross tabulate the results: table(correct = correct_classes, predicted = lda.pred$class)

Reload Boston dataset

data("Boston")
boston_scaled <- scale(Boston)
boston_euk_dist <- dist(boston_scaled)

Let’s run the k-means clustering with 3 clusters and visualise the result with few relevant variables.

# k-means clustering
km <-kmeans(boston_scaled, centers = 3)

# plot the Boston dataset with clusters
pairs(boston_scaled[,6:10], col = km$cluster)

Doesn’t look that good so let’s try 2 instead of 3

# k-means clustering
km <-kmeans(boston_scaled, centers = 2)

# plot the Boston dataset with clusters
pairs(boston_scaled[,6:10], col = km$cluster)


Assignment 5: Data Analysis excercise

date()
## [1] "Mon Dec 11 19:44:47 2023"

Reading data into R

# Load required libraries
library(dplyr)
library(readr)
library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
# Specify the file path into CSV file
file_path <- "/Users/minmaunu/IODS-project/data/human.csv"

# Read the CSV file into a data frame
human <- read_csv(file_path)
## Rows: 148 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Country
## dbl (8): Edu_Ratio, LF_Ratio, Edu.Exp, Life.Exp, GNI, Mat.Mor, Ado.Birth, Pa...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Explore the datasets

# See the structure of the data.
str(human)
## spc_tbl_ [148 × 9] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Country  : chr [1:148] "Norway" "Australia" "Switzerland" "Denmark" ...
##  $ Edu_Ratio: num [1:148] 1.007 0.997 0.983 0.989 0.969 ...
##  $ LF_Ratio : num [1:148] 0.891 0.819 0.825 0.884 0.829 ...
##  $ Edu.Exp  : num [1:148] 17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
##  $ Life.Exp : num [1:148] 81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
##  $ GNI      : num [1:148] 64992 42261 56431 44025 45435 ...
##  $ Mat.Mor  : num [1:148] 4 6 6 5 6 7 9 28 11 8 ...
##  $ Ado.Birth: num [1:148] 7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
##  $ Parli.F  : num [1:148] 39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Country = col_character(),
##   ..   Edu_Ratio = col_double(),
##   ..   LF_Ratio = col_double(),
##   ..   Edu.Exp = col_double(),
##   ..   Life.Exp = col_double(),
##   ..   GNI = col_double(),
##   ..   Mat.Mor = col_double(),
##   ..   Ado.Birth = col_double(),
##   ..   Parli.F = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>
# See the dimension of the data.
dim(human)
## [1] 148   9

Overview of the data

# move the country names to rownames
library(tibble)
human_ <- column_to_rownames(human, "Country")

# Access GGally
library(GGally)

# visualize the 'human_' variables
ggpairs(human_, progress = FALSE)

# Check the summary
summary(human_)
##    Edu_Ratio         LF_Ratio         Edu.Exp         Life.Exp    
##  Min.   :0.1980   Min.   :0.1857   Min.   : 7.00   Min.   :49.00  
##  1st Qu.:0.8097   1st Qu.:0.5971   1st Qu.:11.57   1st Qu.:68.38  
##  Median :0.9444   Median :0.7504   Median :13.55   Median :74.35  
##  Mean   :0.8766   Mean   :0.7006   Mean   :13.42   Mean   :72.44  
##  3rd Qu.:0.9990   3rd Qu.:0.8385   3rd Qu.:15.32   3rd Qu.:77.45  
##  Max.   :1.4967   Max.   :1.0380   Max.   :20.20   Max.   :83.50  
##       GNI            Mat.Mor        Ado.Birth         Parli.F     
##  Min.   :   680   Min.   :  1.0   Min.   :  0.60   Min.   : 0.00  
##  1st Qu.:  5190   1st Qu.: 11.0   1st Qu.: 12.18   1st Qu.:12.15  
##  Median : 12408   Median : 45.5   Median : 31.30   Median :19.50  
##  Mean   : 18402   Mean   :120.9   Mean   : 43.72   Mean   :20.95  
##  3rd Qu.: 25350   3rd Qu.:170.0   3rd Qu.: 69.05   3rd Qu.:27.82  
##  Max.   :123124   Max.   :730.0   Max.   :175.60   Max.   :57.50

Most variables seem to be quite normally distributed, except GNI, maternal mortality and adolescent birth rate are skewed. There are many variables that correlate between each other, such as expected education and life expectancy have strong positive correlation, where as maternal mortality life expectancy have strong negative correlation.

PCA on the non-standardized data

# perform PCA
pca_human <- prcomp(human_)
s <-summary(pca_human)
pca_pr <- round(100*s$importance[2, ], digits = 1)
pc_lab <- paste0(names(pca_pr), " (", pca_pr, "%)")

# draw a biplot
biplot(pca_human, cex = c(0.8, 1), col = c("darkblue", "deeppink2"), xlab = pc_lab[1], ylab = pc_lab[2])
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

There is a large variation in GNI (thus GNI explains 100% of the variation), so we need to standardize the data.

PCA on the standardized data

# standardize the human data
human_scaled <- scale(human_)
pca_human <- prcomp(human_scaled)

#print out the summary 
s <- summary(pca_human)

# rounded percentanges of variance captured by each PC
pca_pr <- round(100*s$importance[2, ], digits = 1)
pca_pr
##  PC1  PC2  PC3  PC4  PC5  PC6  PC7  PC8 
## 51.7 16.3 10.0  7.9  6.2  3.8  2.9  1.2
# New PCA with standardized data
pc_lab <- paste0(names(pca_pr), " (", pca_pr, "%)")
biplot(pca_human, cex = c(0.8, 1), col = c("darkblue", "deeppink2"), xlab = pc_lab[1], ylab = pc_lab[2])

Biplot looks much better now after scaling! PC1 axis tells us that 51% of the variation can be explained by education, life expactancy and GNI, when maternal mortality and adolescent birth rate (arrow to the right) correlate negatively with these variables (arrow to the left). The PC2 axis shows that women’s working and representation in parlament explain 16.2% of the variation.

Explore Tea Data Set

# Load the data into R 
tea <- read.csv("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/tea.csv", stringsAsFactors = TRUE)

# Load needed libraries
library(FactoMineR)
library(tidyr)
library(ggplot2)

# Explore the data set 
str(tea)
## 'data.frame':    300 obs. of  36 variables:
##  $ breakfast       : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
##  $ tea.time        : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
##  $ evening         : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
##  $ lunch           : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dinner          : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
##  $ always          : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
##  $ home            : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
##  $ work            : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
##  $ tearoom         : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ friends         : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
##  $ resto           : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
##  $ pub             : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Tea             : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How             : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ sugar           : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ how             : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ where           : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ price           : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
##  $ age             : int  39 45 47 23 48 21 37 36 40 37 ...
##  $ sex             : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
##  $ SPC             : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
##  $ Sport           : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
##  $ age_Q           : Factor w/ 5 levels "+60","15-24",..: 4 5 5 2 5 2 4 4 4 4 ...
##  $ frequency       : Factor w/ 4 levels "+2/day","1 to 2/week",..: 3 3 1 3 1 3 4 2 1 1 ...
##  $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
##  $ spirituality    : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
##  $ healthy         : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
##  $ diuretic        : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
##  $ friendliness    : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
##  $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ feminine        : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
##  $ sophisticated   : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
##  $ slimming        : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ exciting        : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
##  $ relaxing        : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
##  $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
dim(tea)
## [1] 300  36
View(tea)

The Tea data set has 300 observations and 36 variables. It contains data from tea consumption and habits. Variable Age is integer while all others are factor variables.

# Visualize the data set
gather(tea) %>% ggplot(aes(value)) + geom_bar() + facet_wrap("key", scales = "free") + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 7))
## Warning: attributes are not identical across measure variables; they will be
## dropped

Unfortunately I don’t know why this figure looks like this. Not happy with it!

Multiple Correspondence Analysis (MCA) with selected variables

# column names to keep in the dataset
keep_columns <- c("Tea", "frequency", "sex", "work", "age_Q")

# select the 'keep_columns' to create a new dataset
tea_time <- dplyr::select(tea, keep_columns)
## Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
## ℹ Please use `all_of()` or `any_of()` instead.
##   # Was:
##   data %>% select(keep_columns)
## 
##   # Now:
##   data %>% select(all_of(keep_columns))
## 
## See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
# look at the summaries and structure of the data
summary(tea_time)
##         Tea            frequency   sex           work       age_Q   
##  black    : 74   +2/day     :127   F:178   Not.work:213   +60  :38  
##  Earl Grey:193   1 to 2/week: 44   M:122   work    : 87   15-24:92  
##  green    : 33   1/day      : 95                          25-34:69  
##                  3 to 6/week: 34                          35-44:40  
##                                                           45-59:61
str(tea_time)
## 'data.frame':    300 obs. of  5 variables:
##  $ Tea      : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ frequency: Factor w/ 4 levels "+2/day","1 to 2/week",..: 3 3 1 3 1 3 4 2 1 1 ...
##  $ sex      : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
##  $ work     : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
##  $ age_Q    : Factor w/ 5 levels "+60","15-24",..: 4 5 5 2 5 2 4 4 4 4 ...
# visualize the dataset
library(ggplot2)
pivot_longer(tea_time, cols = everything()) %>% 
  ggplot(aes(value)) + facet_wrap("name", scales = "free") + geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))

Based on the selected variables, 15-24 year old drink the most tea. Most drink tea more than 2 times a day, with women drinking slightly more than men. The most popular type of tea is Earl Grey compared to green or black tea. People are more than 2 times more likely to drink tea outside the workplace. ***

Assignment 6: Data Analysis excercise

date()
## [1] "Mon Dec 11 19:44:51 2023"

RATS

Reading data into R

# Load required libraries
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ purrr     1.0.2
## ✔ lubridate 1.9.3     ✔ stringr   1.5.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ✖ MASS::select()  masks dplyr::select()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(lme4)
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## 
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
library(gridExtra)
## 
## Attaching package: 'gridExtra'
## 
## The following object is masked from 'package:dplyr':
## 
##     combine
# Read the data into R 
RATSL <- read.table("/Users/minmaunu/IODS-project/data/RATSL.txt", header = TRUE)

# Change the categorical variables to factors
RATSL$ID <- as.factor(RATSL$ID)
RATSL$Group <- as.factor(RATSL$Group)

# Take a glimpse on the data
glimpse(RATSL)
## Rows: 176
## Columns: 5
## $ ID     <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1, 2, 3,…
## $ Group  <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 1, 1, 1, 1, 1, …
## $ WD     <chr> "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", …
## $ Weight <int> 240, 225, 245, 260, 255, 260, 275, 245, 410, 405, 445, 555, 470…
## $ Time   <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 8, 8, 8, 8, 8, …

Data has 176 rows and 5 columns. Everything seems to be in order and data is ready for the analysis.

Plot non-standardized data

Let’s draw a line plot of the rat weights in diet groups 1-3

ggplot(RATSL, aes(x = Time, y = Weight, linetype = ID)) +
  geom_line() +
  scale_linetype_manual(values = rep(1:10, times=4)) +
  facet_grid(. ~ Group, labeller = label_both) +
  theme(legend.position = "none") + 
  scale_y_continuous(limits = c(min(RATSL$Weight), max(RATSL$Weight)))

High start weight means high end weight, so the data should be standardized.

Plot standardized data

RATSL <- RATSL %>%
  group_by(Time) %>%
  mutate(stdweight = (Weight-mean(Weight))/sd(Weight)) %>%
  ungroup()

ggplot(RATSL, aes(x = Time, y = stdweight, linetype = ID)) +
  geom_line() +
  scale_linetype_manual(values = rep(1:10, times=4)) +
  facet_grid(. ~ Group, labeller = label_both) +
  theme(legend.position = "none") + 
  scale_y_continuous(limits = c(min(RATSL$stdweight), max(RATSL$stdweight)), name = "standardized bprs")

Looks better. Next let’s look at the response to the diets.

Response to the diets

n <- RATSL$Time %>% unique() %>% length()
RATS_trmt <- RATSL %>% group_by(Group, Time) %>% summarise(mean = mean(Weight), se = sd(Weight)/sqrt(n))
## `summarise()` has grouped output by 'Group'. You can override using the
## `.groups` argument.
glimpse(RATS_trmt)
## Rows: 33
## Columns: 4
## Groups: Group [3]
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…
## $ Time  <int> 1, 8, 15, 22, 29, 36, 43, 44, 50, 57, 64, 1, 8, 15, 22, 29, 36, …
## $ mean  <dbl> 250.625, 255.000, 254.375, 261.875, 264.625, 265.000, 267.375, 2…
## $ se    <dbl> 4.589478, 3.947710, 3.460116, 4.100800, 3.333956, 3.552939, 3.30…
ggplot(RATSL, aes(x= as.factor(Time), y=Weight, fill=Group)) +
  geom_boxplot() +
  theme(legend.position = c(0.8,0.4), panel.grid = element_blank(), panel.background = element_blank(), panel.border = element_rect(fill = FALSE), legend.key = element_rect(fill=FALSE)) +
  scale_y_continuous(name = "mean(Weight) +/- se(Weight)") + 
  scale_x_discrete(name = "Time") + 
  scale_fill_manual(values=c("pink", "forestgreen", "lightblue")) +
  ggtitle("Rat weight over time according to different diets ")

On average diet 1 results lower weight compared to diet 2 and 3 in rats. It seems that diet 2 has the biggest variation on rats.

Next let’s look at the outlier of the data

Outlier in the data

RATSL8 <- RATSL %>% 
  group_by(Group, ID) %>%
  summarize(mean = mean(Weight)) %>%
  ungroup()
## `summarise()` has grouped output by 'Group'. You can override using the
## `.groups` argument.
ggplot(RATSL8, aes(x=Group, y=mean)) + 
  geom_boxplot() + 
  stat_summary(fun.y = "mean", geom = "point", shape=23, size=4, fill = "white") + 
  theme(legend.position = c(0.8,0.8), panel.grid = element_blank(), panel.background = element_blank(), panel.border = element_rect(fill = FALSE)) +
  scale_y_continuous(name = "mean(Weight) per group")
## Warning: The `fun.y` argument of `stat_summary()` is deprecated as of ggplot2 3.3.0.
## ℹ Please use the `fun` argument instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

One outlier can be found from each group, but in group 2 it clearly differs from the mean. Let’s remove it and draw the boxplot again!

RATSL8S1 <- filter(RATSL8, (Group=="2" & mean<500) | Group!="2")

ggplot(RATSL8S1, aes(x = Group, y = mean)) +
  geom_boxplot() +
  theme(legend.position = c(0.8,0.8), panel.grid = element_blank(), panel.background = element_blank(), panel.border = element_rect(fill = FALSE)) +
  stat_summary(fun.y = "mean", geom = "point", shape=23, size=4, fill = "white") +
  scale_y_continuous(name = "mean(Weight) per group")

BPRSL

Reading data into R

# Read the data 
BPRSL <- read.table("/Users/minmaunu/IODS-project/data/BPRSL.txt", header = TRUE)

# Change the categorical variables to factors
BPRSL$treatment <- as.factor(BPRSL$treatment)
BPRSL$subject <- as.factor(BPRSL$subject)

# Take a glimpse on the data
glimpse(BPRSL)
## Rows: 360
## Columns: 5
## $ treatment <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ subject   <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1…
## $ weeks     <chr> "week0", "week0", "week0", "week0", "week0", "week0", "week0…
## $ bprs      <int> 42, 58, 54, 55, 72, 48, 71, 30, 41, 57, 30, 55, 36, 38, 66, …
## $ week      <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …

Data has 360 rows and 5 columns. Everything seems to be in order and data is ready for the analysis.

Plot the data with a line plot

ggplot(BPRSL, aes(x = week, y = bprs, fill=subject, color = treatment)) +
  geom_line(aes(linetype = treatment)) +
  theme(legend.position = c(0.8,0.8), panel.grid = element_blank(), panel.background = element_blank(),
        panel.border = element_rect(fill = FALSE), legend.key = element_rect(fill=FALSE)) + 
  scale_x_continuous(name = "Time (weeks)", breaks = seq(0, 8, 2), expand=c(0,0)) + 
  scale_y_continuous(name = "bprss") +
  scale_color_manual(values=c("pink", "forestgreen")) +
  ggtitle("Symptoms in patients over time")

It’s quite busy plot, but just by eyeballing it seems that with the treatment 1 the symptoms have slightly decreased within time, whereas with treatment 2 there is more variation towards the end.

Next let’s fit a Random Intercept Model.

Random Intercept Model

# Load required libraries
library(Matrix)
library(lme4)
BPRS_ref <- lmer(bprs ~ week + treatment + (1 | subject), data = BPRSL, REML = FALSE)

summary(BPRS_ref)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: bprs ~ week + treatment + (1 | subject)
##    Data: BPRSL
## 
##      AIC      BIC   logLik deviance df.resid 
##   2748.7   2768.1  -1369.4   2738.7      355 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.0481 -0.6749 -0.1361  0.4813  3.4855 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  subject  (Intercept)  47.41    6.885  
##  Residual             104.21   10.208  
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  46.4539     1.9090  24.334
## week         -2.2704     0.2084 -10.896
## treatment2    0.5722     1.0761   0.532
## 
## Correlation of Fixed Effects:
##            (Intr) week  
## week       -0.437       
## treatment2 -0.282  0.000

Now let’s fit a random intercept and random slope model and compare it to the random intercept model.

BPRS_ref1 <- lmer(bprs ~ week + treatment + (week | subject), data = BPRSL, REML = FALSE)

summary(BPRS_ref1)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: bprs ~ week + treatment + (week | subject)
##    Data: BPRSL
## 
##      AIC      BIC   logLik deviance df.resid 
##   2745.4   2772.6  -1365.7   2731.4      353 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.8919 -0.6194 -0.0691  0.5531  3.7976 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  subject  (Intercept) 64.8222  8.0512        
##           week         0.9609  0.9802   -0.51
##  Residual             97.4305  9.8707        
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  46.4539     2.1052  22.066
## week         -2.2704     0.2977  -7.626
## treatment2    0.5722     1.0405   0.550
## 
## Correlation of Fixed Effects:
##            (Intr) week  
## week       -0.582       
## treatment2 -0.247  0.000
anova(BPRS_ref, BPRS_ref1)
## Data: BPRSL
## Models:
## BPRS_ref: bprs ~ week + treatment + (1 | subject)
## BPRS_ref1: bprs ~ week + treatment + (week | subject)
##           npar    AIC    BIC  logLik deviance  Chisq Df Pr(>Chisq)  
## BPRS_ref     5 2748.7 2768.1 -1369.4   2738.7                       
## BPRS_ref1    7 2745.4 2772.6 -1365.7   2731.4 7.2721  2    0.02636 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Let’s try interaction between week and treatment and compare again.

BPRS_ref2 <- lmer(bprs ~ week * treatment + (week | subject), data = BPRSL, REML = FALSE)

summary(BPRS_ref2)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: bprs ~ week * treatment + (week | subject)
##    Data: BPRSL
## 
##      AIC      BIC   logLik deviance df.resid 
##   2744.3   2775.4  -1364.1   2728.3      352 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.0512 -0.6271 -0.0768  0.5288  3.9260 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  subject  (Intercept) 64.9964  8.0620        
##           week         0.9687  0.9842   -0.51
##  Residual             96.4707  9.8220        
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##                 Estimate Std. Error t value
## (Intercept)      47.8856     2.2521  21.262
## week             -2.6283     0.3589  -7.323
## treatment2       -2.2911     1.9090  -1.200
## week:treatment2   0.7158     0.4010   1.785
## 
## Correlation of Fixed Effects:
##             (Intr) week   trtmn2
## week        -0.650              
## treatment2  -0.424  0.469       
## wek:trtmnt2  0.356 -0.559 -0.840
anova(BPRS_ref2, BPRS_ref1)
## Data: BPRSL
## Models:
## BPRS_ref1: bprs ~ week + treatment + (week | subject)
## BPRS_ref2: bprs ~ week * treatment + (week | subject)
##           npar    AIC    BIC  logLik deviance  Chisq Df Pr(>Chisq)  
## BPRS_ref1    7 2745.4 2772.6 -1365.7   2731.4                       
## BPRS_ref2    8 2744.3 2775.4 -1364.1   2728.3 3.1712  1    0.07495 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Based on the chi-squared statistics and p-value of the likelihood ratio test, random intercept and random slope model is the best. Let’s plot:

library(ggplot2)
library(dplyr)
library(tidyr)
library(gridExtra)
p1 <- ggplot(BPRSL, aes(x = week, y = bprs, fill = subject, color = treatment)) +
  geom_line(aes(linetype = treatment)) +
  scale_x_continuous(name = "Time (weeks)", breaks = seq(0, 8, 2)) +
  scale_y_continuous(name = "Observed bprs") +
    scale_color_manual(values=c("pink", "forestgreen")) +
    ggtitle("Symptoms in patients over time") +
  theme(panel.grid = element_blank(), panel.background = element_blank(),
        panel.border = element_rect(fill = FALSE), legend.key = element_rect(fill=FALSE), legend.position = "bottom",)
Fitted <- fitted(BPRS_ref1)
BPRSL <- mutate(BPRSL, fitted=Fitted)

p2 <- ggplot(BPRSL, aes(x = week, y = fitted, fill = subject, color = treatment)) +
  geom_line(aes(linetype = treatment)) +
  scale_x_continuous(name = "Time (weeks)", breaks = seq(0, 8, 2)) +
  scale_y_continuous(name = "Fitted bprs") +
  scale_color_manual(values=c("pink", "forestgreen")) +
    ggtitle(" ") +
  theme(panel.grid = element_blank(), panel.background = element_blank(),
        panel.border = element_rect(fill = FALSE), legend.key = element_rect(fill=FALSE), legend.position = "bottom")


# p1 and p2 side by side
grid.arrange(p1, p2, ncol=2)

Based on the fitted model, we can see thath the treatments are after all equally effective in reducing the symptoms.

Done!